Making the Web More Useful as a Source for Linguistic Corpora
نویسنده
چکیده
Both as a corpus and as a source of texts for corpora the Web offers significant benefits in its virtually comprehensive coverage of major languages, content domains and written text types, yet its usefulness is limited by the generally unknown origin and reliability of online texts and by the sheer amount of “noise” on the Web. This paper describes and evaluates linguistic methods and computing tools to identify representative documents efficiently. To test these methods, a pilot corpus of 11,201 online documents in English was compiled. “Noise filtering” techniques based on n-grams helped eliminate both virtually identical and highly repetitive documents. Individual review of the remaining unique texts revealed that Web pages under 5 KB or over 200 KB tend to have a lower “signal to noise” ratio and therefore can be excluded a priori to reduce unproductive downloads. This paper also compares a selection of these web texts (4949 documents totaling 5.25 million tokens) with the written texts from the British National Corpus (BNC) to assess their similarity. Generally both corpora are quite similar, but important differences are outlined. With judicious selection Web pages provide representative language samples, often prove more useful than off-the-shelf corpora for special information needs, and complement and verify data from traditional corpora.
منابع مشابه
Not Just Bigger: Towards Better-Quality Web Corpora
For the acquisition of common-sense knowledge as well as as a way to answer linguistic questions regarding actual language usage, the breadth and depth of the World Wide Web has been welcomed to supplement large text corpora (usually from newspapers) as a useful resource. While purists’ criticism on unbalanced composition or text quality is easily shrugged off as unconstructive, empirical resul...
متن کاملSpecialized Corpora from the Web and Term Extraction for Simultaneous Interpreters
There is no doubt that the Web is a mine of language data of unprecedented richness and ease of access (Kilgarriff and Grefenstette 2003). As more people use the Web for more tasks, it provides an increasingly representative machine-readable sample of interests and activity in the world (Henzinger and Lawrence 2004). Despite some drawbacks, the Web is an immense source of disposable corpora (Va...
متن کاملA Comparative Analysis of Lexical Bundles in Journalistic Writing in English and Persian: A Contrastive Linguistic Perspective
This paper investigates the use of ‘lexical bundles’ in two broad corpora of journalistic writing. The aim of this study is to compare the use of lexical bundles in the two domains, one consisted of newspaper articles written in English and published in England and the other one comprised of newspaper articles written in Persian from Iranian publications. For this purpose, the frequency...
متن کاملA Comparative Analysis of Lexical Bundles in Journalistic Writing in English and Persian: A Contrastive Linguistic Perspective
This paper investigates the use of ‘lexical bundles’ in two broad corpora of journalistic writing. The aim of this study is to compare the use of lexical bundles in the two domains, one consisted of newspaper articles written in English and published in England and the other one comprised of newspaper articles written in Persian from Iranian publications. For this purpose, the frequency...
متن کاملDraft WebCorp: providing a renewable data source for corpus linguists
The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the ret...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003